Todo

1 Introduction

1.1 Problem description

When dealing with modern large data sets, the observations’ dimensionality often becomes a problem regarding the interpretability of the data set, or for its vizualisation, especially during the data exploration phase. The Principal Component Analysis (PCA) is a statistical technique used to reduce the dimension of the data set, while trying to keep as much information as possible from the original data set (i.e. variance). An example of PCA usage is bringing a high-dimension data set to \(\mathbb{R}^2\) or \(\mathbb{R}^3\) to vizualise its datapoints and afterwards identifying clusters. It does so by computing Principal components from the data set and then projecting datapoints to this new basis.

Mathematically speaking, we can define principal components as an orthonormal basis of unit vectors which sequentially capture the most variance in the data. Practically, the PCs can be found with the following optimization problems for a given centered Gaussian data set \(\mathbf x_1,\ldots,\mathbf x_N \in \mathbb{R}^p\): \[ \begin{split} v_1 &= \underset{v, \| v \|=1}{\mathrm{arg\,max}} \mbox{ } v^\top \widehat{\boldsymbol{\Sigma}} v \\ v_2 &= \underset{v, \| v \|=1, v^\top v_1 = 0}{\mathrm{arg\,max}} v^\top \widehat{\boldsymbol{\Sigma}} v \\ &\;\;\vdots \end{split} \]

with \(\widehat{\Sigma} = N^{-1}\sum_{n=1}^N \mathbf x_n \mathbf x_n^\top \in \mathbb{R}^{p\times p}\) being the empirical covariance matrix of the data set. More information about Principal component analysis can be found in Jolliffe, 2002 or directly on Wikipedia.

Example of 2 PCs on a given data set in $R^2$

Figure 1.1: Example of 2 PCs on a given data set in \(R^2\)

This project focuses on finding the optimal number of principal components (\(r\)), which dictates a trade-off between the model’s complexity and its interpretability. In fact, higher the \(r\), higher the explained variance, but less interpretable the data set. To optimally choose the number of PCs, several techniques have been created, such as the method of percentage of variance explained, the scree-plot method or Cross-validation for PCA.

In this report, one will first find a presentation of each Cross-Validation method for PCA with an initial description, followed by a pseudo-code in Cross-validation on PCA methods. Then, those methods will be ran on simulated data sets with different parameters in Simulation and comparison of CV methods. Furthermore, the percentage of variance explained and the scree-plots methods will be described and tested in Other PCs number selection methods. Finally, the conclusion will wrap up the take-home messages as well as next steps which could be conducted regarding this subject.

2 Cross-validation on PCA methods

2.1 Naïve Approach

A widely used approach for Cross-validation in PCA is a method that has the right intentions but fails at a critical point: it does not take into account any variance-bias tradeoff, meaning that the optimal solution will always be \(r = R\). We start with the usual conventions by denoting \(X \in \mathbb{R}^{N \times p}\) our data with \(N\) observations and \(p\) variables, then as presented in Notes 06 - T. Masak, below is the pseudo-code of the Naïve approach:

  • split data \(\mathbf{X}\) into \(K\) folds \(J_1,\ldots,J_K\) (row indices)
  • for \(k=1,\ldots,K\):
    • solve \(\widehat{\mathbf{L}} = \underset{\mathrm{rank}(\mathbf L) = r}{\mathrm{arg\,min}} \|\mathbf X[J_k^c,:] -\mathbf L \|_2^2\)
    • calculate \(Err_k(r) = \frac{1}{|J_k|}\sum_{m \in J_k} \| x^m - P_{\widehat{\mathbf{L}}} x^m \|_2^2\)
  • end for
  • choose \(\widehat{r} = \underset{r}{\mathrm{arg\,min}} \sum_{k=1}^K Err_k(r)\)

A numerical method to solve the least square problem in the top of the algorithm would be to compute the SVD decomposition of \(\mathbf X[J_k^c,:]\) and truncate the rank to \(r\) by setting the \(p-r\) smallest singular values to zero. To estimate \(\mathbf{x^{m}}, \forall m \in J_k\) with a lower dimensional version, we project it with the matrix \(P_{\mathbf{\widehat{L}}}\) which can be computed via the the R built-in function prcomp() the returns among other objects a basis for the dimension of \(\widehat{\mathbf L}\). The promised issue occurs when computing the error measure without further to-do. Thus the error obviously shrinks when \(\widehat{\mathbf L}\) is allowed to have higher dimensions and converges to \(0\) when its number of dimensions equals those of the initial data set \(\mathbf{X}\), as the SVD will be a nearly exact representation of the matrix \(\mathbf{X}\).

2.2 Wrong PCA Improved

This method corrects the Naïve Approach by introducing a Bias-Variance trade-off for the \(Err_k\) function. Let again \(\mathbf{X} \in \mathbb{R}^{N \times p}\) be our Multivariate Gaussian data set with \(N\) observations and \(p\) variables, mean \(\mu\) and covariance \(\Sigma\). As described in Notes 06 - T. Masak, we implement this method as below :

  • split data into \(K\) folds \(J_1,\ldots,J_K\) (row indices)
  • for \(k=1,\ldots,K\):
    • Compute \(\hat{\mu}\) and \(\hat{\Sigma} := \hat{\Sigma}_r\) (the empirical estimator truncated to rank r) on \(\mathbf{X}\setminus J_k\).
    • Split data points \(\mathbf{x}^m\in \mathbb{R}^{p}, m \in J_k\) into a “missing” part \(\mathbf{x}_{miss}\) that will be used for validation and an “observed” part \(\mathbf{x}_{obs}\) in order to estimate its counterpart
    • Predict the missing part from the observed part using the \(\hat{\mu}\) and \(\hat{\Sigma}_r\)
    • That is compute \(\hat{\mathbf{X}}^m_{miss} = \mathbb{E}_{\hat{\mu},\hat{\Sigma}_r}[\mathbf{X}^m_{miss}|\mathbf{X}^m_{obs}=\mathbf{x}^m_{obs}] = \hat{\mu}_{miss} + \hat{\Sigma}_{miss,obs}\hat{\Sigma}^\dagger_{obs}(\mathbf{X}^m_{obs}-\hat{\mu}_{obs})\), where \(\hat{\Sigma}^\dagger_{obs}\) is the pseudoinverse of \(\hat{\Sigma}_{obs,obs}\) and \(\hat{\mu}_{miss},\hat{\mu}_{obs}\) are \(\hat{\mu}\) split according to the “missed” and “observed” variable indices.
  • end for
  • choose \(\widehat{r} = \underset{r}{\mathrm{arg\,min}} \sum_{k=1}^K Err_k(r)\), where for instance \(Err_k(r)\) can be considered as the MSE for every fold \(J_k\), e.g. \(Err_k(r)=\frac{1}{|J_k|}\sum_{m\in J_k} \|\hat{\mathbf{X}}^m_{miss}-\mathbf{X}^m_{miss}\|^2_2\)

It is important to emphasize some points of this aforementioned algorithm. When we truncate \(\hat{\Sigma}\) to \(\hat{\Sigma}_r\) in every fold, we do it by eigenvalue decomposition, then store \(\alpha = \sum_{i=r+1}^p \lambda_i\), and then set to zero those \(p-r\) smallest eigenvalues. Afterwards, we compute \(\hat{\Sigma}_r = U\tilde{V}U^T\) and then add \(\alpha / p\) to its diagonal. This is done to smooth the drop of eigenvalues when truncating the covariance estimator and avoid undesirable results of our algorithms.

Hence, we reduce the rank of the covariance of \(\mathbf{X}\), but we preserve its dimension which allows us to estimate \(\mathbf{x}^m_{miss}\) in the later part of the algorithm. When splitting the observations in every iteration of the for loop, it naturally comes to mind if we should always take the same or a different (random) split. The first possibility was chosen as that additional random factor would be negligible by the SLLN. Nevertheless, it makes sense to keep two equally sized parts in order to have a proper estimation basis without overestimating. This also gives us a good understanding whether our estimation method will work in harsh conditions, i.e. if half of the data is missing. When estimating \(\hat{\mathbf{x}}^m_{miss}\), the linear estimator is used as it emerges from the law of our data set \(\mathbf{X}\). Hence, it is only applicable in this form to Gaussian distributed observations \(\mathbf{x} \in \mathbb{R}^p\). The estimation also makes use of the pseudoinverse of \(\hat{\Sigma}_{obs,obs}\) denoted by \(\hat{\Sigma}^\dagger_{obs}\). We use this convention in order to avoid singularity issues when inverting a block of the truncated covariance estimator. The pseudoinverse is easily callable in R by the function ginv() embedded in the MASS package.

2.3 Missing data approach

Here it is assumed that the rows \(\mathbf{X}_{n} \in \mathbb{R}^p, n=1,…,N\) of the data matrix \(\mathbf{X}=(\mathbf{X}_{n,j})^{N,p}_{n,j=1}\) are i.i.d. realizations of a multivariate Gaussian random variable of mean \(\mu \in \mathbb{R}^p\) and a covariance \(\Sigma \in \mathbb{R}^{p \times p}\). We will assume that rank\((\Sigma)=r\) for \(r=1,…,R \le p\), and compare how well different values of \(r\) fit the data. As described in Notes 06 - T. Masak, this approach is based on computing the estimators \(\hat{\Sigma}\) and \(\hat{\mu}\) based on observed data by the EM algorithm while the set of missed and observed data is randomly selected at the beginning of the algorithm. Then, it performs more or less the same steps as the previous algorithm to find the optimal number of PCs. Let us now describe the EM algorithm used in our case and then describe the Missing data approach algorithm.

Let \(\tilde{\mathbf{X}}\) be the data set containing missing values randomly placed on the original \(N \times p\) data set. The EM algorithm works as follows :

  • Initialize \(\hat{\mu} = \frac{1}{N}\sum_{n=1}^N(\tilde{\mathbf{X}}^{obs}_{n})\) and \(\hat{\Sigma} = I_{p\times p}\), where \(\tilde{\mathbf{X}}_{obs}\) is the data set restricted to observed values.
  • Set \(\hat{\Sigma}^{old} = C\hat{\Sigma}\), and \(tol = 0.01\) for a given constant \(C\) big enough to enter the while loop (e.g. \(C = 600\))
  • while \((||\hat{\Sigma}-\hat{\Sigma}^{old}||_F > tol||\hat{\Sigma}^{old}||_F)\) where \(||.||_F\) is the Frobenius norm
    • Set \(\hat{\Sigma}^{old} = \hat{\Sigma}\)
    • Estimate \(\tilde{\mathbf{X}}_{miss}^{n}\) by a linear estimator : \(\tilde{\mathbf{X}}_{miss}^n = \hat{\mu}_{miss} + \hat{\Sigma}_{miss,obs} \hat{\Sigma}_{obs}^{\dagger}(\tilde{\mathbf{X}}_{obs}^n-\hat{\mu}_{obs})\)
    • Set \(\hat{\mu} =\frac{1}{N}\sum_{n=1}^N(\tilde{\mathbf{X}}_{n})\), \(\hat{\Sigma} = \frac{1}{N-1}\sum_{n=1}^N (\tilde{\mathbf{X}}_i-\hat{\mu})(\tilde{\mathbf{X}}_i-\hat{\mu})^T\), i.e. the two MLE estimators
  • end while
  • return the estimated \(\hat{\mu}, \hat{\Sigma}\)

This method slightly differs from the one presented in Notes 05 - T. Masak, mainly due to poor results when estimating \(\hat{\Sigma}\) and light confusion about notations used in the aforementioned implementation. Another matrix approach for the EM algorithm was also tried from Machine Learning, Murphy which gave good results for \(\hat{\mu}\) but was also unsuccessful when predicting \(\hat{\Sigma}\) when \(r\) becomes too large.

Below is the pseudo-code for the Missing Data method :

  • split the data set in \(K\) folds \(J_1,\ldots,J_K\) (row indices)
  • Compute the bivariate set of missed data \(\Omega\)
  • for \(k=1,\ldots,K\):
    • Compute \(\hat{\mu}\) and \(\hat{\Sigma}\) by the EM algorithm with the data set \(\mathbf{X}\) and missing indices \(\Omega\)
    • Truncate \(\hat{\Sigma}\) to a rank \(r\) matrix \(\hat{\Sigma}_r\) by eigenvalue decomposition.
    • Reuse the missed data \(\Omega\) to split the data points \(\mathbf{x}^m\in \mathbb{R}^{p}, m \in J_k\) into a “missing” part \(\mathbf{x}_{miss}\) that will be used for validation and an “observed” part \(\mathbf{x}_{obs}\) in order to estimate its counterpart
    • Estimate the missing part from the observed part with a linear estimator using the \(\hat{\mu}\) and \(\hat{\Sigma}_r\) as previously described in the last CV method.
  • end for
  • choose \(\widehat{r} = \underset{r}{\mathrm{arg\,min}} \sum_{k=1}^K Err_k(r)\), where for instance \(Err_k(r)\) can be considered as the MSE for every fold \(J_k\), e.g. \(Err_k(r)=\frac{1}{|J_k|}\sum_{m\in J_k} \|\hat{\mathbf{X}}^m_{miss}-\mathbf{X}^m_{miss}\|^2_2\)

As one can observe, this method is pretty similar to the previous one, except for the computation of the missing values method and the parameters estimation. Here, the missing values are randomly selected across the data set and can be seen as holes in the data set, while for the previous CV method, whole columns (aka variables) were taken as missed. Also, as we are now dealing with holes in the data set, it was not guaranteed that there would be a fully defined estimator for \(\Sigma\), compared to the other method for which we could easily select parts of the observed and missed covariance. This forced us to use the EM algorithm for this CV method, as opposed to the other one where a simple mean() and cov() function call did the job.

2.4 Matrix completion method

This final methods is somehow a bit more general, as we do not need to assume any probability distribution of the data \(\mathbf{X} \in \mathbb{R}^{N\times p}\), but just that \(\mathbf{X}\) has a finite rank. Indeed, for a given data matrix \(\mathbf{X}\), we assume that a portion of this matrix is missed and that the remaining elements are observed. Let \(\Omega\) be a bivariate index set for the observed data in \(\mathbf{X}\). The matrix completion method consists of finding a matrix \(\mathbf{M}\) such that :

\[ \mathrm{arg \, min}_{\mathbf{M}=(\mathbf{M}_{ij})_{i,j=1}^{N \times p}} \sum_{(i,j) \notin \Omega} (\mathbf{X}_{ij} - \mathbf{M}_{ij})^2 \quad \mathrm{s.t.} \quad \mathrm{rank}(\mathbf M)=r \]

This optimization method gives us the best matrix \(\mathbf M\) with rank \(r\) w.r.t least-square error on the missed data. However, finding the solution of such a optimization problem is \(NP\)-hard, so one can use the following iterative algorithm to find local best candidates for the matrix \(\mathbb M\) :

  • Select an initial candidate \(\mathbf{M}^{(0)}\), here : \(M^{(0)}\) is a set of multivariate Guassian random variables with mean \(\hat{\mu}_{obs}\) and covariance \(I_{p\times p}\) where \(\hat{\mu}_{obs}\) is the empirical mean computed for each column with its respective observed entries of the matrix.
  • Set \(l=0\), \(tol = 10^{-4}\)
  • do
    • \(l = l+1\)
    • Compute the SVD of \(\mathbf{M}^{(l-1)}\), i.e. \(\mathbf{M}^{(l-1)} = UDV^T\)
    • Copy \(D\) into \(\widetilde{D}\) and set the \(p-r\) diagonal elements of \(\widetilde{D}\) to zero (truncating to rank \(r\))
    • Set \(\widetilde{\mathbf{M}} = U\widetilde{D}V^T\), the new matrix of rank \(r\).
    • Set \(\mathbf{M}^{(l)}\)as : \[ \mathbf{M}^{(l)} = \begin{cases} \mathbf{X}_{ij} \quad \text{for } (i,j) \in \Omega \\ \widetilde{\mathbf{M}}_{i,j} \quad \text{for } (i,j) \notin \Omega \end{cases} \]
  • while \(\|M^{(l)}-M^{(l-1)}\|_F> tol\|M^{(l)}\|_F\) and \(l < 1000\):
  • Store the error \(\sum_{(i,j) \in \Omega} (\mathbf{X}_{ij} - M^{(l)}_{ij})^2\) and then repeat this experiment for each value of \(r\) from \(1\) to \(p\), then pick the \(r\) which gives the lowest error.

We decided multiply the initial \(\|M^{(l)}-M^{(l-1)}\|_F\) stopping criteria by \(\|M^{(l)}\|_F\) to take into account the magnitude of the elements in \(M\), and added the condition \(l<1000\) in case of the algorithm would not converge to the desired tolerance in a reasonable amount of time. In other words, the algorithm stops and returns the matrix \(M^{(l)}\) if there was a change of less than \(0.01\%\) from the previous matrix \(M^{(l-1)}\), or if we go above \(1000\) iterations.

2.5 KDE modified approach

This approach is a modification of the KDE method from Notes 06 - T. Masak. Let, as usual, \(\mathbf{X}\) be a multivariate Gaussian data set of size \(N \times p\) The goal is to minimize w.r.t \(r\) the following quantity :

\[ \mathbb{E} \| \widehat{C}_r - C \|_F^2 = \mathbb{E}\| \widehat{C}_r \|_F - 2 \mathbb{E} \langle \widehat{C}_r, C \rangle + \| C \|_F^2 \]

with \(\widehat{C}_r\) being the truncated covariance estimator of rank \(r\), \(C\) being the covariance matrix and \(\|.\|_F\) being the Frobenius norm while \(\langle X, Y \rangle\) represents the sum of all elements from the Hadamard product of the two matrices \(X,Y\), or in mathematical terms : \(\langle X, Y \rangle := \sum_{i,j}X_{i,j}Y_{i,j}\). Assuming that the mean of \(\mathbf{X}\) is zero, we get: \[ \mathbb{E} \langle \widehat{C}_r^{(-n)}, \mathbf{X}_n \mathbf{X}_n^\top \rangle = \langle \underbrace{\mathbb{E}[ \widehat{C}_r^{(-n)}}_{\approx \mathbb{E} \widehat{C}_r}], \underbrace{\mathbb{E} [\mathbf{X}_n \mathbf{X}_n^\top}_{ = C}] \rangle \approx \mathbb{E} \langle \widehat{C}_r , \mathbf{X}_n \mathbf{X}_n^\top \rangle \]

Therefore, since the truncated covariance estimator is not linear we obtain an approximation, which we will plug back into the initial quantity we wish to minimize. By differentiating the quantity w.r.t \(r\), we obtain the desired ranking as it will be the only value depending on \(r\). Overall, we implement this method as follows for every rank \(r\):

  • Estimate Covariance \(\widehat{C}\) from given data \(\mathbf{X}\)
  • Decompose \(\widehat{C}\) and set \(p-r\) smallest eigenvalues to \(0\), then rebuild it to reduce its rank to \(r\) (\(\widehat{C}_r\))
  • Set \(Err_r = \|\widehat{C}_r\|_F\)
  • for \(j\) in \(1,\dots,N\):
    • Compute \(\widehat{C}_r^{(-j)}\) by estimating the covariance of \(\mathbf{X}^{(-j)}\), \(j\) being the missing observation in \(\mathbf{X}\) then truncate it to rank \(r\).
    • Subtract \(\frac{2}{N}\langle \widehat{C}_r^{(-j)}, \mathbf{X}_j\mathbf{X}_j^T\rangle\) from \(Err_r\)
  • end
  • Choose \(r^* = \underset{r}{\mathrm{arg\,min}} \quad Err_r\) according to the minimal error \(Err_r\)

Stepping into the for-loop we realize that it is pretty similar to a Leave-One-Out Cross-validation. In this Method we are trading off high computational cost against a stable solution. As shown in the simulations, we will often get negative values for \(Err_r\), as one of its components cannot be computed (\(\|C\|_F^2\)) but was not necessary to find the optimal \(r\).

3 Simulation and comparison of CV methods

3.1 Simulation data set description

As described in Cross-validation on PCA methods, one is now equipped with multiple approaches to recover the best amount of dimensions needed to represent data “accurately” of a data set. Intuitively, every method must have its different advantages and drawbacks. To get a better feeling on how the Cross-Validation methods behave in different circumstances, one can run a simulation study on a variety of synthesized data sets. This allows us to discover the limits of the different methods used and to conclude and recommend different Algorithms with respect to its underlying data. By construction of our Algorithms, every single one of them besides the Naïve Approach, the Matrix completion method is only applicable to Multivariate Gaussian distributed data. Therefore we restrict ourselves to this special distribution and create corresponding data in order to get meaningful and comparable results. All in all, \(9\) different data sets were considered and the results of the five algorithms were compared. Furthermore, we only consider centered Gaussian Multivariate Random variables and create three base data sets, on which we infer three kinds of different noises. The first data set is a standard Multivariate Gaussian data set while the second and third base data sets will rely on a random and structured covariance matrix respectively. We denote our base data sets as \(D^{i}_0, D^{i}_1\) and \(D^{i}_2\) with \(i \in \{1,\dots,3\}\), corresponding to each type of noise. It consists of data \(\mathbf{X}=\{\mathbf{X}^1,\dots,\mathbf{X}^N\}\) and noise \(\epsilon^{i}\). The amount of different parameters make it difficult to write get a single big picture of the problem, which is the reason we implemented a Shiny App that lets the reader tune the parameters as he pleases. We strongly recommend to check it out and test the limits of the implemented cross-validation methods.

\[ D^{i}_0 = \mathbf{X} + \epsilon^{i}_p, \text{ where } \mathbf{X}^j \sim \mathcal{N}(0, I_{p\times p}), \forall j = 1,\dots,N\\ D^{i}_1 = \mathbf{X} + \epsilon^{i}_p, \text{ where } \mathbf{X}^j \sim \mathcal{N}(0, \Sigma_1), \forall j = 1,\dots,N \text{ and } \Sigma_1=MM^{T}, (M_{ij})_{1\leq i,j\leq p} \stackrel{\text{iid}}{\sim} \mathcal{N}(0,1)\\ D^{i}_2 = \mathbf{X} + \epsilon^{i}_p, \text{ where } \mathbf{X}^j \sim \mathcal{N}(0, \Sigma_2), \forall j = 1,\dots,n \text{ and } \Sigma_2=MM^{T}, (M_{ij})_{1\leq i,j\leq p}=\frac{(i-1)p+j}{p} \]

For the covariance \(\Sigma_2\) of \(D^{i}_2\) we will get a structured matrix that will emphasize the meaningful position of the last variables. For clarity purposes, we will give an example for \(p=3\):

\[ M = \frac{1}{9}\begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ \end{pmatrix} \implies \Sigma_2 = \frac{1}{81}\begin{pmatrix} 66 & 78 & 90 \\ 78 & 93 & 108 \\ 90 & 108 & 126 \\ \end{pmatrix} \]

In order to have valid comparisons across all of our methods and validate their correctness, we fix the rank of \(\mathbf{X}\) as follows. First, we realize that by the random construction of \(\mathbf{X}\), we can assume that it has full rank \(p\). Second, we truncate it to a specific rank \(r\), which should be the recommended number of PCs of our Algorithms. Due to asymmetry of \(\mathbf{X}\), we do so by performing a singular value decomposition of \(\mathbf{X}\) with the built-in svd() function. We will set its \(p-r\) smallest singular values to \(0\); which enables us to control its rank. Please note that \(D_2\) is already of rank \(2\) before adding noise. Having adapted \(\mathbf{X}\), we infer either uniform noise, differing noise.

\[ \epsilon^{1}_p \stackrel{\text{iid}}{\sim}\mathcal{N}(0,\sigma^2I_{p\times p}), \sigma=0.02\text{V}_{SVD}^r \text{ where } \text{V}_{SVD}^r \text{ is the r-th singular value of } \mathbf{X}\\ \epsilon^{2}_p\stackrel{\text{iid}}{\sim}\mathcal{N}(0,\sigma^2 I_{p\times p}), \sigma=0.001\text{V}_{SVD}^r \\ \epsilon^{3}_p\stackrel{\text{iid}}{\sim}\mathcal{N}(0, U), U_{ii}\stackrel{\text{iid}}{\sim} \mathcal{U}([0,0.005\text{V}_{SVD}^r]) \text{ and } U_{ij}=0, \text{for } i\neq j \\ \]

We run our simulation with \(n=100\) observations per data set and \(p=8\) variables. We truncate our data set to rank \(r=3\) and repeat our cross-validation \(sim=5\) times on the seed \(1312\). The methods using classical cross-validation will loop over \(K=5\) folds. In order to take into consideration the magnitude of our data set, we chose to multiply the noise’s variance by the \(r-th\) singular value of \(\mathbf{X}\).

3.2 Comparison with respect to noise

For this analysis, we focus on \(D_0 \in \mathbb{R}^{N \times p}\) and will vary the noise \(\epsilon^i_p\) for \(i \in [1,3]\). Hereafter are the errors for each CV method described in Section 2.

Comparison of CV methods on base 0 data set with different noises as indicated by Noise°1,2 and 3

Figure 3.1: Comparison of CV methods on base 0 data set with different noises as indicated by Noise°1,2 and 3

Fixing our data set to be \(D_0\) and applying our CV Methods, we get very different results for each algorithm. Please note that the \(y\)-axis in in logscale in order to get meaningful insights about the errors, especially for the Missing Data approach. First, the error of the Naïve Approach decreases continuously and achieves its best estimation when allowing the method to make use of all \(8\) variables, as theoretically expected.

The Wrong PCA Method Improved method will however counter this behavior and the error will rise after hitting the optimal \(r=3\) value. Overall this would lead the Wrong PCA Improved Method to favor rank \(3\) as estimate for the rank of our modified data set. In general we can still see that the Wrong PCA Improved has difficulties to extract the exact amount of variables, especially when the added noise to the data set is constantly high, as the high curve in yellow is less steep than the blue one with low noise. We also note that for the high noise curve, there is a tendency to revert back when \(r\) increases, which might cause problems as the biggest value \(r\) might be returned.

The Missing data approach differs in its overall behavior from the previous to methods. While the error only decreases slightly in the first and the second variable, it drops noticeably for \(r=3\) in our Simulation Study especially for low and differing noise. Exceeding rank \(3\), there is a strong increase in the error due to the \(\Sigma\) estimation of the EM-Algorithm diverging from the expected value. A possible reason for this divergence is the choice of initial covariance matrix (in our case the Identity matrix). Here, we conclude the this approach also finds the correct dimension, but one has to be careful of the amount of noise added as this challenges the algorithm quite a bit.

The error measure of the Matrix completion method seems to be a mix of the error of the Wrong PCA Improved and the Missing Data approach. We observe a slight increase in the first two variables, followed by a strong error drop on variable \(3\) before it starts to increase again. The increase after variable \(3\) is clearly stronger than at the beginning.

As for the KDE Modified Approach, it will be analysed at the end of this section due to similar behaviors with respect to different data sets.

After this initial analysis, one would like to examine how the CV methods generalize to other data sets, i.e. moving from the identity matrix as covariance to a random matrix. Hereafter are the results for \(D_1\), being the data set with a random covariance matrix, as defined in the Simulation data set description

Comparison of CV methods on base 1 data set with different noises as indicated by Noise°1,2 and 3

Figure 3.2: Comparison of CV methods on base 1 data set with different noises as indicated by Noise°1,2 and 3

The [Naive Approach] seems to preserve the same shape and error size as on \(D_0\). The Wrong PCA Improved method slightly changes as it still drops in within the first three variables, but stabilizes more strongly afterwards. Again we see how the total noise added accounts for the different stabilization errors. For the Missing Data Approach error we observe a strong similarity to the previous analysis. Its error starts to decrease until variable \(3\) and then spikes up to stabilize. The only noticable difference is, that the total amount of noise added plays now a stronger criterion on where the curve converges. With the uniform low noise converging with the lowest error. In the Matrix Completion method we now observe that the algorithms doesn’t screw up its prediction with high uniform noise, but does this time for differing noise. For high noise the method favors \(3\) variables, where as it chooses \(2\) variables for differing noise. For a uniform low noise it seems to work well by, having a sharp decrease in error rank \(3\) and then increasing again.

We now examine to which extent our algorithms change if we consider a structured covariance \(\Sigma_2\), with increasing impact on the last variables.

The Naïve Approach seems to preserve the same shape and error size as on \(D_0\). The Wrong PCA Improved method slightly changes as it still drops in within the first three variables, but stabilizes more strongly afterwards especially for Noise n°3. Again we see how the total noise added accounts for the different stabilization error. Interestingly, the Missing Data Approach and the Matrix Completion Method aren’t able to return more precise estimation with a higher amount of variables and actually increase in error size. Both have kind of the same shape and error size, whereas the latter one seems to be a smoother version of the first one. Also the error curve sticks closely together for the uniform noise in the Missing Data Approach, whereas the differing noise starts beneath the uniform noises but ends being less accurate.

Comparison of CV methods on base 2 data set with different noises as indicated by Noise°1,2 and 3

Figure 3.3: Comparison of CV methods on base 2 data set with different noises as indicated by Noise°1,2 and 3

The Wrong PCA has behaves again the same regardless of the noise added. The first observation also applies for the Wrong PCA improved. The only subtlety we want to emphasize on, is that it seems once again to be sensitive with the total amount of noise added. A stronger decrease in the first two variables for the uniform and differing noise is observable. Interestingly, the Missing Data Approach and the Matrix Completion Method don’t seem to handle the kind of structure very well. In fact the algorithms aren’t able to return more precise estimation with a higher amount of variables and actually increase in error size. Both have kind of the same shape and error size, whereas the latter one seems to be a smoother version of the first one. Also the error curve sticks closely together for the uniform noise in the Missing Data Approach, whereas the differing noise starts beneath the other two error curves but then exceed them after rank \(5\). The adhesivity is somewhat stronger for the error Matrix completion error as the error curve is almost identical regardless of the noise added.

KDE Modified Approach for the 3 data sets with different noises as indicated by Noise°1,2 and 3

Figure 3.4: KDE Modified Approach for the 3 data sets with different noises as indicated by Noise°1,2 and 3

KDE Approach for base 0 data set with different noises as indicated by Noise°1,2 and 3

Figure 3.5: KDE Approach for base 0 data set with different noises as indicated by Noise°1,2 and 3

For the KDE Modified Approach, we observe in general that for every variable until the third one it drops very strongly. Having reached rank \(3\) the quality of its estimation seem to stay the same and correspondingly the error. In case of the second data set we observe the same behaviour but instead of stabilizing after rank \(3\) it does that after reaching variable \(2\) by construction of our data set.

3.3 Comparison across different covariance matrices

We now fix the error to be the third one and analyze how our different CV methods behave on the different kinds of base data sets, e.g. \(D_0^3, D_1^3\) and \(D_2^3\). We only consider the third kind of noise as we consider it to be a mix between the uniform high and low noise \(\epsilon^1, \epsilon^2\) and the plots only differ negligibly but stay consistent in their fundamental shape for the non-mentioned noise. We also want to mention that the following plots are the same plots as before, only rearranged in order to have a better representation on how our CV Methods behave with respect to the different base data sets.

Comparison of CV methods on different data set for Noise n°3 (differing noise)

Figure 3.6: Comparison of CV methods on different data set for Noise n°3 (differing noise)

Unsurprisingly, the shape of the Naïve Approach doesn’t change at all. However, we are able to observe that the error level is slightly different until variable \(3\) but then really sticks together.

The [Artificially turn the unsupervised problem into a supervised one] again decreases until the true value of the data set and then stabilized. The error is the lowest for the data set with a structured data set, then for the one with an Identity Matrix as covariance and the worst being the one with a random covariance matrix.

The Missing data approach gives us a similar shape for \(D_0\) and \(D_1\) but a fundamentally different one for \(D_2\). Whereas the error again decreases until rank \(3\) for \(D_0\) and \(D_1\), it doesn’t seem to be the case for \(D_2\). This approach doesn’t find the true rank for \(D_2\) and keeps increasing continuously from the beginning until the last three variables. For \(D_0\) and \(D_1\) we notice the same sharp increase in error size when exceeding the true rank of the data set.

For the Matrix completion method we get similar characteristics for \(D_0\) and \(D_1\), whereas the method fails at \(D_2\). Evaluating it on \(D_0\) and \(D_1\) we observe a slight increase in the fist two variables but then a noticeable error decrease when allowing the method to truncate to the true rank of data data set. Looking carefully at the plot we see that the method actually overshoots the true rank for \(D_1\) as it get the best estimations at the fourth variable. Exceeding the third and fourth variable respectively, the error reincreases but stabilizes beneath the initial error size of variable \(1\) and \(2\). For \(D_2\) we monitor a convex, square root like function favoring dimension \(1\) to be the true rank of the data set even though we would have expected it to be \(2\). \(D_1\) and \(D_2\) seem to be a source of trouble as they don’t allow us to conclude the true rank based on the error development within the variables.

3.4 Comparison with respect to initial rank truncation

In this last section about simulations, the focus will be on the initial truncating rank of the data set \(\mathbf{X}\). As described earlier, we run the CV methods on a data set which we artificially truncate via SVD to assess the correctness of the output of each method. Please note that the [Wrong PCA] method has been omitted as its behavior is known (will always converge to \(r = 8\)). The graphs below plot the errors of each method on \(D_0^2\), being the 0 basis data set with low noise, while each line represents the initial truncated rank \(r \in [0,8]\).

Comparison of CV methods on base 0 data set (Noise n°2) and different initial rank truncation

Figure 3.7: Comparison of CV methods on base 0 data set (Noise n°2) and different initial rank truncation

The Wrong PCA Improved method generalizes well for different values of initial rank, as we can observe the dive at the right rank of \(\mathbf{X}\). As for the Missing Data approach, we observe that for \(r > 3\), the method does not give good results anymore. One could think that it is due to \(D_0^2\) being of rank \(3\) and therefore doing the SVD truncation would not change anything to the initial data set, but after careful verification, the rank of \(D_0\) with low noise is of rank \(8\), meaning that this explanation is not plausible, and that the algorithm therefore performs poorly. However, one reason which was already highlighted in the previous sections is the poor behaviour of the \(\hat{\Sigma}\) from the EM algorithm implementation. For the Matrix completion approach, we observe good results for every rank \(r\), until \(r=6\) where the distinction becomes somewhat blurrier. Lastly, the [KDE Approach] gives us exact results for each rank.

As for \(D_1\) with a low noise, the same conclusions from \(D_0\) can be drawn, and will therefore not be included in the report. As \(D_2\) is of rank \(2\), studying the truncation of the rank will not give interesting results either.

4 Other PCs number selection methods

4.1 Percentage of variance explained method

The percentage of variance explained method is based on the data set’s covariance matrix, and its eigenvalues and the total variability (sum of diagonal entries of the covariance matrix). When computing the PCs for a given data set \(\mathbf{X} \in \mathbb{R}^{N \times p}\) of covariance \(\Sigma\), we create an orthonormal basis with a corresponding diagonal covariance matrix \(\Sigma_{PCA}\) with its entries being the eigenvalues of corresponding PCs. As we sequentially construct the components, the eigenvalues \(\lambda_1, \lambda_2, ... ,\lambda_p \in \mathbb{R}\) of the new matrix will be a decreasing sequence. The method of percentage of variance explained is simply to select the first \(r\) components such that \(\frac{1}{V}\sum_{i=1}^r \lambda_i \ge \tau\) for an arbitrary threshold \(\tau \in [0,1]\) and a total variability \(V = \sum_{i=1}^p \lambda_i = tr(\Sigma_{PCA})\). Below are a few examples on the method on \(D_0\):

Eigenvalues of the PCA covariance matrix, with a treshold of 90% for $D_0$ w.r.t noise

Figure 4.1: Eigenvalues of the PCA covariance matrix, with a treshold of 90% for \(D_0\) w.r.t noise

4.2 Scree-plot method

The Scree-plot is again related to the eigenvalues of the newly created orthonormal basis of PCs. As shown in the above, the eigenvalues corresponding to the PCs are a decreasing sequence, as seen on the graph above. The Scree-plot method consists of choosing the value of \(r\) which corresponds to the elbow of the eigenvalue graph. With \(100\) and \(8\) variables we get an overview to which extent every single value contributes to the variance of the data set as a whole.

Scree plot on $D_0$ with different noises as indicated by 1,2 and 3

Figure 4.2: Scree plot on \(D_0\) with different noises as indicated by 1,2 and 3

We can clearly see that the first three plots singular values are much bigger then the latter ones. In every plot the first three singular values only decrease slightly until the \(4\)th, where we can observe a strong drop. From the \(4\)th on the values decrease in the same manner as the first \(3\). In this circumstances the [Scree-plot method (non-automatic)] would favor the \(4\)th value as this value initiates the elbow characteristic of the graphs in every single one of the graphs. This selection would unfortunately be the wrong one as by construction, e.g. truncation of our data set we were expecting the third singular value to initiate the “elbow” form.

As seen below, the “Elbow-Plot” of our data set with a random matrix as covariance has a somehow similar shape. The first eigenvalue however explains much more on the overall variance of our data set. It drops strongly until the second, where the graph seems to stabilize. This behaviour holds on until the third singular value, whereafter we observe a similar drop and decreasing behaviour as in the previous plot. As we already concluded in Figure 4.2, the [Scree-plot method (non-automatic)] would favor a wrong value.

Scree plot on $D_1$ with structured matrix as covariance with different noises as indicated by 1,2 and 3

Figure 4.3: Scree plot on \(D_1\) with structured matrix as covariance with different noises as indicated by 1,2 and 3

The scree plot of \(D_2\), however looks more promising as we get a sharp drop after the first value for every kind of added noise. Therefore choosing the optimal rank \(r\) according to the [Scree-plot method (non-automatic)] would coincide with the true rank of data set (which was by construction \(2\) regardless of the amount of observations and variables) ++TRUE?!

Scree plot on $D_2$ with structured matrix as covariance with different noises as indicated by 1,2 and 3

Figure 4.4: Scree plot on \(D_2\) with structured matrix as covariance with different noises as indicated by 1,2 and 3

4.3 Comparison with CV methods

Compared to CV methods described in Cross-validation on PCA methods, those methods are relatively easy to implement, efficient as there is only a eigenvalue decomposition of the covariance matrix which is computationnally expensive and pretty interpretable. However, those methods are not well generalizable if there is some missing values in the data set or if the eigenvalues do not follow a clear elbow pattern (for the scree-plot).

5 Conclusion

In conclusion, several methods for choosing the optimal number of components have been presented. We were able to observe the behavior of those CV methods depending on the noise, the structure of covariance matrices and the truncated dimensions thanks to simulations on various data sets. We concluded that most methods are able to give accurate predictions in very specific and especially unnoisy circumstances. We have seen that the amount of noise is crucial for a method in order to get useless, e.g. in Matrix Completion in 3.1 or Matrix Completion in 3.2. Not only the added noise but also the underlying covariance matrix impacts the quality of the ouput of the CV-Methods as we were able to see in Figure 3.3. Moreover, we had to assume that our data set was a centered multivariate Gaussian for every method to be comparable, however, some methods generalize better to other type of data set’s distribution (Matrix Completion for example) while others are at least the way we made use of them restricted to this special distribution Wrong PCA Improved or Missing data approach. While comparing these function and having set a “low” amount of observations, variables and simulations we hadn’t that much of a struggle concerning computation time. Nevertheless we want to mention, that the last three methods we presented, hence the Wrong PCA Improved, the Matrix completion method and the KDE modified approach are not very “efficient” algorithms.

To get a better feeling on how the different parameters choosen in the simulations play together we want to encourage the reader to try out our Shiny.app